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NEW BERRY-ESSEEN AND WASSERSTEIN BOUNDS IN THE 
CLT FOR NON-RANDOMLY CENTERED RANDOM SUMS BY 

PROBABILISTIC METHODS 


CHRISTIAN DOBLER 


Abstract. We prove abstract bounds on the Wasserstein and Kolmogorov dis¬ 
tances between non-randomly centered random sums of real i.i.d. random variables 
with a finite third moment and the standard normal distribution. Except for the 
case of mean zero summands, these bounds involve a coupling of the summation 
index with its size biased distribution as was previously considered in j(J R96I for 
the normal approximation of nonnegative random variables. When being special¬ 
ized to concrete distributions of the summation index like the Binomial, Poisson 
and Hypergeometric distribution, our bounds turn out to be of the correct order 
of magnitude. 


1. Introduction 


Let N,X i,X 2 ,... be random variables on a common probability space such that 
the Xj, j > 1, are real-valued and N assumes values in the set of nonnegative integers 
Z + = {0,1,... }. Then, the random variable 


N 



( 1 . 1 ) 


is called a random sum. Such random variables appear frequently in modern proba- 
biliy theory, as many models for example from physics, finance, reliability and risk 
theory naturally lead to the consideration of such sums. Furthermore, sometimes a 
model, which looks quite different from (II.Ill at the outset, may be transformed into 
a random sum and then general theory of such sums may be invoked to study the 
original model [GK96] . For example, by the recent so-called master Steiner formula 
from |MT14| the distribution of the metric projection of a standard Gaussian vector 
onto a closed convex cone in Euclidean space can be represented as a random sum 
of i.i.d. centered chi-squared random variables with the distribution of N given by 
the conic intrinsic volumes of the cone. Hence, this distribution belongs to the class 
of the so-called chi-bar-square distributions, which is ubiquitous in the theory of 
hypotheses testing with inequality constraints (see e.g. |Dyk9l| and [ Sha88j ). This 
representation was used in |GNP14| to prove quantitative CLTs for both the distri¬ 
bution of the metric projection and the conic intrinsic volume distribution. These 
results are of interest e.g. in the held of compressed sensing. 

There already exists a huge body of literature about the asymptotic distributions of 
random sums. Their investigation evidently began with the work [Rob48j of Robbins, 
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who assumes that the random variables X i, X 2 ,... are i.i.d. with a finite second mo¬ 
ment and that N also has a finite second moment. One of the results of [Rob48 j is 
that under these assumptions asymptotic normality of the index N automatically 
implies asymptotic normality of the corresponding random sum. The book (GK96] 
gives a comprehensive description of the limiting behaviour of such random sums 
under the assumption that the random variables N,X l ,X 2 ,... are independent. In 
particular, one may ask under what conditions the sum S in (11,11) is asymptotically 
normal, where asymptotically refers to the fact that the random index N in fact usu¬ 
ally depends on a parameter, which is sent either to infinity or to zero. Once a CLT 
is known to hold, one might ask about the accuracy of the normal approximation 
to the distribution of the given random sum. It turns out that it is generally much 
easier to derive rates of convergence for random sums of centered random variables, 
or, which amounts to the same thing, for random sums centered by random variables 
than for random sums of not necessarily centered random variables. In the centered 
case one might, for instance, first condition on the value of the index N, then use 
known error bounds for sums of a fixed number of independent random variables 
like the classical Berry-Esseen theorem and, finally, take expectation with respect 
to N. This technique is illustrated e.g. in the manuscript |Dobl2] and also works 
for non-normal limiting distributions like the Laplace distribution. For this reason 
we will mainly be interested in deriving sharp rates of convergence for the case of 
non-centered summands, but will also consider the mean-zero case and hint at the 
relevant differences. Also, we will not assume from the outset that the index N has 
a certain fixed distribution like the Binomial or the Poisson, but will be interested 
in the general situation. 

For non-centered summands and general index N, the relevant literature on rates of 
convergence in the random sums CLT seems quite easy to survey. Under the same as¬ 
sumptions as in [ Rob48] the paper |Eng83| gives an upper bound on the Kolmogorov 
distance between the distribution of the random sum and a suitable normal distri¬ 
bution, which is proved to be sharp in some sense. However, this bound is not very 
explicit as it contains the Kolmogorov distance of N to the normal distribution with 
the same mean and variance as N as one of the terms appearing in the bound, for 
instance. This might make it difficult to apply this result to a concrete distribution 
of N. Furthermore, the method of proof cannot be easily adapted to probability 
metrics different from the Kolmogorov distance like e.g. the Wasserstein distance. 
In |Kor87] a bound on the Kolmogorov distance is given which improves upon the 
result of |Eng83| with respect to the constants appearing in the bound. However, 
the bound given in |Kor87] is no longer strong enough to assure the well-known as¬ 
ymptotic normality of Binomial and Poisson random sums, unless the summands are 
centered. The paper |Kor88] generalizes the results from |Eng83| to the case of not 
necessarily identically distributed summands and to situations, where the summands 
might not have finite absolute third moments. However, at least for non-centered 
summands, the bounds in |Kor88] still lack some explicitness. 

To the best of our knowledge, the article |Sunl3] is the only one, which gives bounds 
on the Wasserstein distance between random sums for general indices N and the stan¬ 
dard normal distribution. However, as mentioned by the same author in [Sun 14] . the 
results of |Sunl3] generally do not yield accurate bounds, unless the summands are 
centered. Indeed, the results from |Sunl3j do not even yield convergence in distribu¬ 
tion for Binomial or Poisson random sums of non-centered summands. 
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The main purpose of the present article is to combine Stein’s method of normal 
approximation with several modern probabilistic concepts like certain coupling con¬ 
structions and conditional independence, to prove accurate abstract upper bounds 
on the distance between suitably standardized random sums of i.i.d. summands mea¬ 
sured by two popular probability metrics, the Kolmogorov and Wasserstein distances. 
Using a simple inequality, this gives bounds for the whole classe of L p distances of 
distributions, 1 < p < oo. These upper bounds, in their most abstract forms (see 
Theorem 12.51 below), involve moments of the difference of a coupling of N with 
its size-biased distribution but reduce to very explicit expressions if either N has 
a concrete distribution like the Binomial, Poisson or dirac delta distribution, the 
summands Xj are centered, or, if the distribution of N is infinitely divisible. These 
special cases are extensively presented in order to illustrate the wide applicability 
and strength of our results. As indicated above, this seems to be the first work which 
gives Wasserstein bounds in the random sums CLT for general indices N, which re¬ 
duce to bounds of optimal order, when being specialized to concrete distributions 
like the Binomial and the Poisson distributions. Using our abstract approach via 
size-bias couplings, we are also able to prove rates for Hypergeometric random sums. 
These do not seem to have been treated in the literature, yet. This is not a surprise, 
because the Hypergeometric distribution is conceptually more complicated than the 
Binomial or Poisson distribution, as it is neither a natural convolution of i.i.d. ran¬ 
dom variables nor infinitely divisible. Indeed, every distribution of the summation 
index which allows for a close size-bias coupling should be amenable to our approach. 
It should be mentioned that Stein’s method and coupling techniques have previously 
been used to bound the error of exponential approximation [PRllj and approxima¬ 
tion by the Laplace distribution |PR14j of certain random sums. In these papers, 
the authors make use of the fact that the exponential distribution and the Laplace 
distribution are the unique fixed points of certain distributional transformations and 
are able to succesfully couple the given random sum with a random variable having 
the respective transformed distribution. In the case of the standard normal distribu¬ 
tion, which is a fixed point of the zero-bias transformation from [ GR97] . it appears 
natural to try to construct a close coupling with the zero biased distribution of the 
random sum under consideration. However, interestingly it turns out that we are 
only able to do so in the case of centered summands whereas for the general case an 
intermediate step involving a coupling of the index N with its size biased distribu¬ 
tion is required for the proof. Nevertheless, the zero-bias transformation or rather 
an extension of it to non-centered random variables, plays an important role for our 
argument. This combination of two coupling constructions which belong to the clas¬ 
sical tools of Stein’s method for normal approximation is a new feature lying at the 
heart of our approach. 

The remainder of the article is structured as follows: In Section [2] we review the 
relevant probability distances, the size biased distribution and state our quantitative 
results on the normal approximation of random sums. Furthermore, we prove new 
identities for the distance of a nonnegative random variable to its size-biased dis¬ 
tribution in three prominent metrics and show that for some concrete distributions, 
natural couplings are L 1 -optimal and, hence, yield the Wasserstein distance. In Sec¬ 
tion [3] we collect necessary facts from Stein’s method of normal approximation and 
introduce a variant of the zero-bias transformation, which we need for the proofs of 
our results. Then, in Section SI the proof of our main theorems, Theorem 12.51 and 
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Theorem 12.71 is given. Finally, Section [5] contains the proofs of some auxiliary results, 
needed for the proof of the Berry-Esseen bounds in Section [4j 


2. Main results 


Recall that for probability measures /i and is on (R, i3(R)), their Kolmogorov dis¬ 
tance is defined by 

d K (ia,v) : = sup|//((— 00 , 2 ]) -^((-oo,z])| = H-F-GHoo, 

z€R 


where F and G are the distribution functions corresponding to // and is, respec¬ 
tively. Also, if both fi and v have finite first absolute moment, then one defines the 
Wasserstein distance between them via 


4v(/b v) 


sup 

/igLip(l) 



j 


where Lip(l) denotes the class of all Lipschitz-continous functions g on R with Lip- 
schitz constant not greater than 1. In view of Lemma [2.11 below, we also introduce 
the total variation distance bewtween /a and v by 

drvi/J', u ) SU P |M-B) _ U {B) | . 

RgB(R) 


If the real-valued random variables X and Y have distributions /i and is, respectively, 
then we simply write d/c(X, Y ) for djc(£(X), C(Y)) and similarly for the Wasserstein 
and total variation distances and also speak of the respective distance between the 
random variables X and Y. Before stating our results, we have to review the concept 
of the size-biased distribution corresponding to a distribution supported on [0, 00 ). 
Thus, if X is a nonnegative random variable with 0 < E[X] < 00 , then a random 
variable X s is said to have the X-size biased distribution , if for all bounded and 
measurable functions h on [0, 00 ) 


(2.1) E[Xh(X)\ = E[X]E[h(X s )], 


see, e.g. |GR96| . jAOTTTj or [AGK13| . Equivalent ly, the distribution of X s has Radon- 
Nikodym derivative with respect to the distribution of X given by 


P(X s G dx) x 

P(X G dx) ~ E[X\ ’ 


which immediately implies both existence and uniqueness of the X-size biased dis¬ 
tribution. Also note that (12.11) holds true for all measurable functions h for which 
E\Xh(X)\ < 00 . In consequence, if A" G L P (P) for some 1 < p < 00 , then 
X s G L p ~\P) and 


E 


(A'*) 


P-1 


E [X p ] 

~w .r 


The following lemma, which seems to be new and might be of independent interest, 
gives identities for the distance of A" to A" s in the three metrics mentioned above. 
The proof is deferred to the end of this section. 


LEMMA 2 . 1 . Let X be a nonnegative random variable such that 0 < E[X\ < 00. 
Then, the followinq identities hold true: 

E\X-E\X]\ 

2 E[X] 


(a) d K (X,X s ) = d TV (X,X s ) 
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(b) If additionally E[X 2 ] < oo, then dw(X,X s ) 


Var(A) 


Remark 2.2. (a) It is well known (see e.g. [Dud02 ]) that the Wasserstein distance 
dw(X, Y) between the real random variables A" and Y has the dual representation 

(2.2) d w (X,Y)= inf E\X — Y\ , 

(■ X,Y)en(X,Y ) 


where 7 t(A, Y) is the collection of all couplings of X and Y, i.e. of all pairs 
(A", Y) of random variables on a joint probability space such that X = X and 

Y = Y. Also, the inhmum in (12.21) is always attained, e.g. by the quantile 
transformation: If U is uniformly distributed on (0,1) and if, for a distribution 
function F on M, we let 

F~ l {p) := infjx G K : F(x) > p}, p G (0,1) , 

denote the corresponding generalized inverse of F, then F~ 1 (U) is a random 
variable with distribution function F. Thus, letting F x and Fy denote the 
distribution functions of A" and Y, respectively, it was proved e.g. in | Maj ?8| 
that 

inf E\X-Y\ = E\F X \U) - Fy\U)\= [ \F x \t) - Fy\t)\dt. 

(X ,Y)en(X,Y) Jo 

Furthermore, it is not difficult to see that X s is always stochastically larger than 
A, implying that there is a coupling (A, A s ) of A" and X s such that X s > X 
(see |AG 10] for details). In fact, this property is already achieved by the coupling 
via the quantile transformation. By the dual representation (12.21) and the fact 
that the coupling via the quantile transformation yields the minimum L 1 distance 
in (12.2p we can conclude that every coupling (A", A" s ) such that A" s > X is optimal 
in this sense, since 

E\X S - A | = E[X S ] — E [A] = E[Fx}(U)] - E[F^\U)] 

= E\F X }{U) - F X \U)\ = d w (X,X s ). 

Note also that, by the last computation and part by (b) of Lemma [2.11 we have 
E [A"] - E [A] =E[X']-E [A] = d w ( A, A*) = . 

(b) Due to a result by Steutel |Ste73j . the distribution of A" is infinitely divisible, 
if and only if there exists a coupling (A, A" s ) of A and X s such that X s — X 
is nonnegative and independent of X (see e.g. [ AGIO] for a nice exposition and 
a proof of this result). According to (a) such a coupling always achieves the 
minimum /J-distance. 

(c) It might seem curious that according to part (a) of Lemma 12.11 the Kolmogorov 
distance and the total variation distance between a nonnegative random variable 
and one with its size biased distribution always coincide. Indeed, this holds true 
since for each Bore 1-measurable set BCRwe have the inequality 

|P(A S G B) - P(X eB) | < |P(A S > m) - P(A > m) \ 

<d K ( A,A S ), 




















6 


CHRISTIAN DOBLER 


where m := E[X\. Thus, the supremum in the definition 

d TV (X,X s ) = sup |P(X S E B) — P{X e B) | 

Bet3( R) 

of the total variation distance is assumed for the set B = ( m , oo). This can be 
shortly proved and explained in the following way: For teK, using the defining 
property (12.11) of the size biased distribution, we can write 

H{t) := P{X S <t)- P(X <t) = m- l E[{X - m))l {x < t} ] . 

Thus, for s < t we have 

H{t) - H{s) = m~ x E [(A 7 " - m))l {s<x < t} ] , 

and, hence, H is decreasing on (— 00 ,m) and increasing on (m, 00 ). Thus, for 
every Borel set B C K. we conclude that 

P(X S e B) - P(X e B) = [ 1 B (t)dH(t)< [ l B n{m,oo){t)dH(t) 

J M «/ M 

< [ l { m,oo)(t)dH(t) = P(X S >m)~ P(X > m). 

Jr 

Note that for this argumentation we heavily relied on the defining property (12.11) 
of the size biased distribution which guaranteed the monotonicity property of the 
difference H of the distribution functions of X s and X , respectively. Since A" s is 
stochastically larger than A", one might suspect that the coincidence of the total 
variation and the Kolmogorov distance holds true in this more general situation. 
However, observe that the fact that X s dominates X stochastically only implies 
that H < 0 but that it is the monotonicity of H on (— 00 ,m) and on ( 771 , 00 ) 
that was crucial for the derivation. 

example 2.3. (a) Let X ~ Poisson(A) have the Poisson distribution with paramter 
A > 0. From the Stein characterization of Poisson(A) (see |Che75] ) it is known 
that 

E{Xf(X)\ = A E[f(X + 1)] = E[X]E[f(X + 1)] 
for all bounded and measurable /. Hence, A" + 1 has the A"-size biased distribu¬ 
tion. As X + l > A", by Remark 12.21 1 Ins coupling yields the minimum L 1 -distance 
between X and X s , which is equal to 1 in this case. 

(b) Let n be a positive integer, p 6 (0,1] and let Ad,..., X n be i.i.d. random variables 
such that Ad ~ Bernoulli(p). Then, 

n 

X := Xj ~ Bin(n,p) 

3=1 

has the Binomial distribution with parameters n and p. From the construction 
in [ GR96] one easily sees that 

n 

+ 1 

3=2 

has the A-size biased distribution. As X s > X, by Remark 12.21 this coupling 
yields the minimum L 1 -distance between X and X s , which is equal to 

d w {X, X s ) = E[ 1 - Ad] = 1 - p = V !!’^ } 
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in accordance with Lemma 12. 11 

(c) Let n, r, s be positive integers such that n < r + s and let X ~ Hyp(n; r, s ) have 
the Hypergeometric distribution with parameters n, r and s, i.e. 


P(X = k) = 


GHA) 
(T) ’ 


k — 0,1,..., n 


with E[X\ = Imagaine an urn with r red and s silver balls. If we draw 
n times without replacement from this urn and denote by X the total number 
of drawn red balls, then X ~ Hyp(?r; r,s). For j = 1 denote by Xj 

the indicator of the event that a red ball is drawn at the j-th draw. Then, 
X = J2j=i Xj and since the Xj are exchangeable, the well-known construction 
of a random variable X s wth the X-size biased distribution from |GR96j gives 
that X s = 1 + Xp= 2 Xj, where 

£((X',... ,x;)) = £((X 2 , ...,X n )\X ! = 1) . 

But given X\ = 1 the sum ^” =2 Xj has the Hypergeometric distribution with 
parameters n — 1, r — 1 and s and, hence, 

X s = Y + 1, where Y ~ Hyp(r7 — 1; r — 1, s) . 

In order to construct an L 1 -optimal coupling of A" and X s , fix one of the red 
balls in the urn and, for j = 2,..., ?r, denote by Yj the indicator of the event 
that at the j'-tli draw this fixed red ball is drawn. Then, it is not difficult to see 
that 

n n 

Y := l {Xl =i} X J + 1 (Ni=o} ~ Y i) ~ - 1, s) 

3 =2 3 =2 

and, hence, 

n n 

X s := Y + 1 = 1(X,.1} Xi X, + ol - Y >) + 1 


3 =2 


3 =2 


- i{Ai=i } x + i{Xi=o} yX +1 - Y, Y 3 

3 =2 

has the A"-size biased distribution. Note that since J^" =2 Yj < 1 we have 


X s — X — l { xi=o, (1 — Xf > 0, 

3 =2 


and consequently, by Remark 12.21 (a), the coupling (A", A" s ) is optimal in the 
Id-sense and yields the Wasserstein distance between X and X s : 


Var(X) 

E[X\ 


d w (X,X s ) = E\X S -X| = 


nr 

r+s 


s{r + s — n) 

(r + s)(r + s - 1) 


We now turn back to the asymptotic behaviour of random sums. We will rely on 
the following general assumptions and notation, which we adopt and extend from 
[Rob48j. 
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Assumption 2.4. The random variables N, X\, A 2 , ... are independent, Ad, X2, ■ ■ ■ 
being i.i.d. and such that i?|Ad | 3 < 00 and E[N 3 } < 00. Furthermore, we let 

a:=E[N], P 2 :=E[N 2 ], 7 2 := Var(iV) = /3 2 - a 2 , S 3 := E[N 3 }, 

a := E[ Ai], b 2 := E[X 2 }, c 2 := Var(Ad) = b 2 - a 2 and d 3 := E | A x - E[Xf | 3 . 

By Wald’s equation and the Blackwell-Girshick formula, from Assumption 12.41 we 
have 


(2.3) fi := E[S] = aa and cr 2 := Var(5 l ) = ac 2 + a 2 y 2 . 


The main purpose of this paper is to assess the accuracy of the standard normal 
approximation to the normalized version 

S — n S — aa 
c ac 2 + ay 2 



of S measured by the Kolmogorov and the Wasserstein distance, respectively. As 
can be seen from the paper |Rob48] . under the general assumption that 


2 2 1 22 

<7 = ac + a 7 —> 00 , 


there are three typical situations in which W is asymptotically normal, which we will 
now briefly review. 

1 ) c 7 ^ 0 7 ^ a and 7 2 = o(a) 

2 ) a = 0 7 ^ c and 7 = o(a) 

3) N itself is asymptotically normal and at least one of a and c is different from zero. 

We remark that 1) roughly means that N tends to infinity in a certain sense, but 
such that it only fluctuates slightly around its mean a and, thus, behaves more or 
less as the constant a (tending to infinity). If c = 0 and a 7 ^ 0, then we have 

S = aN a.s. 


and asymptotic normality of S is equivalent to that of N. For this reason, unless 
specifically stated otherwise, we will from now on assume that c ^ 0. However, we 
would like to remark that all bounds in which c does not appear in the denominator 
also hold true in the case c = 0 . 


Theorem 2.5. Suppose that Assumption \2.4 holds, let W be given by (12.4)1 and let Z 
have the standard normal distribution. Also, let (N,N S ) be a coupling of N and N s 
having the N-size biased distribution such that N s is also independent of A 1; A 2 ,... 
and define D := N s — N. Then, we have the following bound: 


, s , , TIT ^ ^ 2c 2 &7 2 3 ad 3 aa ' 

(a) dw(W, Z ) <- X- + + 


cr J 

2 aa 2 b 




<7 

7^01 


'■Jlfv<x(E[D\N]) 

a\a\b 2 
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(b) If, additionally, D >0, then we also have 

d K( w, z) < + #a(3V ^ +4) + 


4cr 3 

7 -Vi+ 2)^f 


8cr 3 


o° 


Ol u (j' 

+ — P[N=0)+ —E[N-^1 {n> _ 1) ) 


d 5 a 


V 


CO*■ 


+ 


era 


o L 


Var (E[D \ N]) + 


a a 


\h 2 


2 o 3 


E (E[D 2 \N]y 


+ 


a\a 




+ 


a\a 


| h 2 


co- 


'y/2TT 


E[D\ n> _ 1 } N-^] + + ^=)e[D1 {n> _ 1 } N-^] . 


Remark 2.6. (a) In many concrete situations, one has that a natural coupling of 
N and N s yields D > 0 and, hence, Theorem 12.51 gives bounds on both the 
Wasserstein and Kolmogorov distances (note that the fourth summand in the 
bound on dyy(W, Z) vanishes if D > 0). For instance, by Remark 12.21 (b), this is 
the case, if the distribution of N is infinitely divisible. In this case, the random 
variables D and N can be chosen to be independent and, thus, our bounds 
can further be simplified (see Corollary 12.91 below). Indeed, since N s is always 
stochastically larger than N, by Remark 12.21 (a) it is always possible to construct 
a coupling ( N , N s ) such that D = N s — N > 0. 

(b) However, although we know that a coupling of N and N s such that D = N s —N > 
0 is always possible in principle, sometimes one would prefer working with a 
feasible and natural coupling which does not have this property. For instance, 
this is the case in the situation of Corollary 12 . Ill below. This is why we have not 
restricted ourselves to the case D > 0 but allow for arbitrary couplings (N, N s ). 
We mention that we also have a bound on the Kolmogorov distance between W 
and a standard normally distributed Z in this more general situation, which is 
given by 

7 

d K (yv,z)<Y J B j , 

3 = 1 

where B\, B 2 , B 4 , B 5 , B 6 and B 7 are defined in (14.331) . (I4.38|) . (11.1 II) . (14.501) . (14.591) 
and (14.641) . respectively, and 

B 3 :=^-^/y a v(E[D\N}). 

It is this bound what is actually proved in Section [4] Since it is given by a 
rather long expression in the most general case, we have decided, however, not 
to present it within Theorem 12.51 

(c) We mention that the our proof of the Wasserstein bounds given in Theorem 12.51 
is only roughly five pages long and is not at all technical but rather makes use of 
probabilistic ideas and concepts. The extended length of our derivation is simply 
due to our ambition to present Kolmogorov bounds as well which, as usual within 
Stein’s method, demand much more technicality. 


The next theorem treats the special case of centered summands. 
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Theorem 2.7. Suppose that Assumption \2.1\ holds with a = E[Xi\ = 0, let W be 
given by (12.41) and let Z have the standard normal distribution. Then, 


and 


wZ) < + (rf^+4) +l] J^ + ( l v - 2 + 2] jP_ 


d w (W, Z)<T + XT 
a c 6 \ a 


a 


c 3 a 


+ P(N = 0)+ ( ^ + 


7 




\J E[1{n>i}N . 


REMARK 2.8. (a) The proof will show that Theorem 12.71 holds as long as 
E[N 2 ] < oo. Thus, Assumption 12.41 could be slightly relaxed in this case. 

(b) Theorem 12.71 is not a direct consequence of Theorem 12.51 as it is stated above. 
Actually, instead of Theorem 12.51 we could state a result, which would reduce to 
Theorem 12.71 if a = 0, but the resulting bounds would look more cumbersome in 
the general case. Also, they would be of the same order as the bounds presented 
in Theorem 12.51 in the case that a ^ 0. This is why we have refrained from 
presenting these bounds in the general case but have chosen to prove Theorem 
12.51 and Theorem 12.71 in parallel. Note that, if a ^ 0, then a necessary condition 
for our bounds to imply the CLT is that 

(2.5) E[D 2 ] = o(l) and ^y/Vai(E[D\N]) = o(l). 

This should be compared to the conditions which imply asymptotic normality 
for N by size-bias couplings given in [ GR96j . namely 

( 2 . 6 ) 4 E \P 2 ] = and ^\/Var(E[D|Ar]) = o(l). 

If (12.61) holds, then from [ GR9 6] we know that N is asymptotically normal and, 
as was shown within the proof of Lemma 1 in [ Rob48| . this implies that 7 = o(a). 
Since, if a 7 ^ 0, (12.61) implies ( 12 . 5ft . we can conclude from Theorems 12.51 and 12.71 
that W is asymptotically normal. In a nutshell, if the bounds from (GR96) on 
the distance to normality of N tend to zero, then so do our bounds and, hence, 
yield the CLT for W. However, the validity of (12.61) is neither necessary for (12.51) 
to hold nor for our bounds to imply asymptotic normality of W (see Remark 

1233 (b) below!. 

(c) For distribution functions F and G on M and 1 < p < 00 , one defines their 
L p -distance by 

\\F — G\\ p := ( [\F{x) -G(x)\ P dx 

\J R 



It is known (see |Dud02| f that \\F — G||i coincides with the Wasserstein distance 
of the corresponding distributions p and u, say. By Holder’s inequality, for 
1 < p < 00 , we have 

p— 1 _1 

\\F-G\\ P < d/c(/i, v) p -dvv(/i, u)p . 

Thus, our results immediately yield bounds on the ZT-distances of C(W) and 

N{ 0,1). 
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(d) It would be possible to drop the assumption that the summands be identically 
distributed. For reasons of clarity of the presentation, we have, however, decided 
to stick to the i.i.d. setting. See also the discussion of possible generalizations 
before the proof of Lemma 12.11 at the end of this section. 


COROLLARY 2.9. Suppose that Assumption \2.4\ holds, letW be given by (I2.4[) and let 
Z have the standard normal distribution. Furthermore, assume that the distribution 
of the index N is infinitely divisible. Then, we have 


d w (w, Z ) < ESI + 3o# + U* 3 ~ + 7 4 - and 


< 7 ° 


aa° 


djc(W, Z) < 


d 3 a(3\/27r + 4) ^ c 3 a + ^ + \ V®d 3 + c 2 a 


8 cr 3 


<7 




C<7 


<7- 


-P(N = 0) 


i\b 2 (5 3 a + y 4 — f3 4 — 7 2 a 2 ) f \Z2 tt 1 

77 \1T + 2 


aa J 


+ + 7 4 ~ P 4 ~ 7 2 « 2 ( ^^J )bc2 + VP(N = 0 )^) 

+ E[1 {n > 1} N~ 1/2 ] 


a\b 2 (6 3 a + y 4 — f3 4 — y 2 ct 2 ) "f 2 d 3 \a\b d 3 a 7 2 bc 


cacr 


! a/27T 


ecu 


a- 


’‘y/Zrr 


Proof. By Remark 12.21 (b) we can choose D > 0 independent of N such that N s = 
N + D has the iV-size biased distribution. Thus, by independence we obtain 


Var (D) = Var (N s ) - Var(IV) = A [(V s ) 2 ] - E[N S ] 2 - y 2 
E\N 3 ] /B[iV 2 ]\ 2 , 

-imKEF) " 7 

a a 2 

This gives 

A3 ^4 _ o4 

E[D 2 ] = Var (D) + E[D} 2 = - + 7 7 - y 2 . 

a or 

Also, 


Var (E[D \ N ]) = Var (E[D\) = 0 and 


E (E[D 2 \N]y 


E[D 2 ] 


in this case. Now, the claim follows from Theorem 12.51 

□ 


In the case that N is constant, the results from Theorem 12.51 reduce to the known 
optimal convergence rates for sums of i.i.d. random variables with finite third mo¬ 
ment, albeit with non-optimal constants (see e.g. (Shell) and |GoH'0| for comparison). 


Corollary 2 .10. Suppose that Assumption ]!!}. 4\ holds, let W be given by (12.ip and 
let Z have the standard normal distribution. Also, assume that the index N is a 
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positive constant. Then, 


d w (W, Z ) < 
d K {W,Z) < 



Proof. In this case, we can choose N s 
from Theorem 12.51 


N yielding D — 0 and the result follows 

□ 


Another typical situation when the distribution of W may be well approximated by 
the normal is if the index N is itself a sum of many i.i.d. variables. Our results yield 
very explicit convergence rates in this special case. This will be exemplified for the 
Wasserstein distance by the next corollary. Using the bound presented in Remark 12.61 
(b) one would get a bound on the Kolmogorov distance, which is more complicated 
but of the same order of magnitude. A different way to prove bounds for the CLT 
by Stein’s method in this special situation is presented in Theorem 10.6 of |CGS11] . 
Their method relies on a general bound for the error of normal approximation to the 
distribution of a non-linear statistic of independent random variables which can be 
written as a linear statistic plus a small remainder term as well as on truncation and 
conditioning on N in order to apply the classical Berry-Esseen theorem. Though our 
method also makes use of conditioning on N , it is more directly tied to random sums 
and also relies on (variations of) classical couplings in Stein’s method (see the proof 
in Section [4] for details). 


Corollary 2.11. Suppose that Assumption \2.4\ holds, let W be given by (12.41) and 
let Z have the standard normal distribution. Additionally, assume that the distribu- 

T> 

tion of the index N is such that N — Ni + ... + N n , where ngN and Ni ,..., N n are 
i.i.d. nonnegative random variables such that E[Nf] < oo. Then, using the notation 

:= E[Ni\ , ft := E[Nf\ , 7l 2 := Var(Ab) , 5f := U^ 3 ] and 

2 2 i 2 2 

(Tj :=c«i+a 7l 


we have 


d w (W,Z) < 


1 / 2 c 2 6 7 2 3a\d 3 

Vn\ erf + af + 


[2 a 2 7l 2 


+ 


2ai(a 2 6 + |a|6 2 ) 



Proof. From |GR96] (see also |CGS11] 1 it is known that letting N( be independent 
of Ni,..., N n and have the A^-size biased distribution, a random variable with the 
A^-size biased distribution is given by 

n 

N s ■= N{ + N j i yielding D = N s l -N l . 

3 =2 


Thus, by independence and since N\, ..., N n are i.i.d., we have 


and, hence, 


E[D\N] = E[Np - ^-N 


Var(£[D|JV]) = 



n 
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Clearly, we have 

a = na .\, y 2 = 717 2 and a 2 = na 2 . 

Also, using independence and ( 12 .ip . 

E[D' 2 } = E[N 2 - 2NiN( + (iVf) 2 ] = - 2a l E[N {] + ^[(iV*) 2 ] 

= Pl-2c,A+ S ± = S ±_pl 

OL 1 CL\ OL 1 

Thus, the bound follows from Theorem 12.51 


□ 


Very prominent examples of random sums, which are known to be asymptotically 
normal, are Poisson and Binomial random sums. The respective bounds, which follow 
from our abstract findings, are presented in the next two corollaries. 


Corollary 2.12. Suppose that Assumption 2.4 holds, let W be given by (12.41) and 
let Z have the standard normal distribution. Assume further that N ~ Poisson(A) 
has the Poisson distribution with parameter A > 0. Then, 

, , Jir 1 / 2 c 2 3 d 3 |a|\ 

dw(W';Z)<_(_ + _ + T ) and 

. If \/~2 7T (3\/27r + 4)d 3 c 3 (7 r \ d 3 

MW ,Z)<^— + l P- - 3 + - + (-^ +3 )_ 


|a|(\/27r + 4 + 8 d 3 ) \a 


8 b 


c \ c \d\ _\/o 

H- -j= + -— -j= ) + 73-6 H— —e 

cs/Tk b\^2n J b 2 b 


Proof. In this case, by Example 12.31 fa), we can choose D — 1, yielding that 

E[D 2 } = 1 and V&r(E[D\ N]) = 0. 

Note that 

E[1 {n > 1} N~ 1/2 ] < yjE[l 


by Jensen’s inequality. Also, using k + 1 < 2k for all k G N, we can bound 

00 ■yfc 00 \k 

kk\ 


^kk\- 2e ^JkTl)k\ A" ^(ife + 1 )! 

fc=i fc=i v ' fc=i v ' 


= we 


2 _ A ^ A fc+1 


2 a v~^ X 1 2 

A ^ l\ ~ X 
1=2 


Hence, 


£[lpv>i } iV- 1/2 ] < 


V2 

Vx 


Noting that 

a = y 2 = A and cr 2 = A(a 2 + c 2 ) = A 6 2 , 
the result follows from Theorem 12.51 


□ 
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REMARK 2.13. The Berry-Esseen bound presented in Corollary 12.121 is of the same 
order of A as the bound given in |KS12| . which seems to be the best currently 
available, but has a worst constant. However, it should be mentioned that the bound 
in |KS12] was obtained using special properties of the Poisson distribution and does 
not seem likely to be easily transferable to other distributions of N. 


Corollary 2.14. Suppose that Assumption \2.4\ holds, let W be given by (12.41) and 
let Z have the standard normal distribution. Furthermore, assume that N ~ Bin (n,p) 
has the Binomial distribution with parameters n G N and p G (0,1]. Then, 


d w (W ., Z) < 


d K (W,Z) < 


y/np{b 2 — pa 2 ) 

l 

1 


3/2 


(2 c 2 b + |a| 6 2 ) (1 — p) + 3 d 3 


adp\/b 2 — pa 2 yfl — p^j and 


y/np(b 2 — pa 2 ) 


3/2 


3 (\f2ir + A)bc 2 ^l — p (3y/2n + 4) d 3 

Q _|_ ' —J— — 


|a| 6 2 i/l — p |a|£> 2 v / 27r(l — p) 


8 


s r]3 

V2 + 2 J — + yjl - p(a 2 p + y/2\a\bd 3 ) 


yjrvp{b 2 — pa 2 ) \ \2 

a/2(1 — p)b{2b 2 - a 2 ) 


+ 


zsPZk 


cr . \a\b 


b 2 — pa 2 


(1 -P) n + 


b 2 — pa 2 


(1 ~P) 


n+1 


REMARK 2.15. Bounds for binomial random sums have also been derived in |Sunl4| 
using a technique developed in |Tih80j . Our bounds are of the same order (np )~ 1//2 
of magnitude. 


Proof of Corollary 2. If. Here, we clearly have 

a = np , y 2 = np( 1 — p) and cr 2 = np{a 2 ( 1 — p) + c 2 ). 

Also, using the same coupling as in Example 12.31 (b) we have D ~ Bernoulli(l — p), 

N 


This yields 


E[D 2 } = E[D] = 1 - p and E[D\N] = 1 - 


Var(E[D\N]) = \ Var(A0 = ——— 
v ' n 2 n 


n 


We have D 2 = D and, by Cauchy-Schwarz, 

E[D1 {n > 1} N~V 2 ] < ^E\D 2 ]^E[l {N > l} N-'] = ^/T^\/e[1 {n > 1} N-i]. 


Using 


, , , 

k\kj n + l\k + l J n\k + 1 


n + l\ 2 fn + 1 \ 

< — ( . 1, 1 < k < n , 
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we have 


=e+v y-!»•-* ^ 


k \k 

k =1 v 

o n +! / i i \ 

2 x—-\ / n + 1 \ i 


np 


E 

2=2 


P (1 -P) 


71+1 — l 


M*+i/ 

fc=l V 7 


n—k 


< 


np 


Thus, 


T;[T> 1 {JV > 1} ^- 1/2 ] < P - and S[ 1 { ^> 1 } 7 V- 1/2 ] 


Also, we can bound 




(E[D 2 \N]) 2 < E[D 4 ] = E[D] = 1 -p. 


< 


V2_ 

y/np 


Now, using a 2 + c 2 = b 2 , the claim follows from Theorem 12.51 


□ 


COROLLARY 2.16. Suppose that Assumption \2.4\ holds, let W be given by (I2.4p and 
let Z have the standard normal distribution. Assume further that N Hyp (Vt; r, s) 
has the Hypergeometric distribution with parameters n,r,s € N such that n < 
min{r, s}. Then, 


d w {W,Z) < 


nr 


V 2 f 2b s(r + s — n) ^ 3d 3 | |a|6 2 s(r + s — n) 

(r + s ) 2 




+ K 


r + sJ \ c (r + s ) 2 
a 2 / 2 / min{r, s} 


+ + 


1/2 


and 


c 2 V 7 T \ n(r + s) 

WZ)<(—) ~ 1/2 [1 + + ■ 4 > 6 +/ + s - ") Y 1/2 

vr + s/ 4c V (r + s ) 2 / 


(3\f2r: 9 r 5\d 3 f VZk 

(— + +a)? + (— +1 


|a| 6 2 s(r + s — n) 
c 3 (r + s ) 2 


+ ^|a|6 2 c 3 \/27r + 

(«) 


|a|M 3 


b \ f 2 s(r + s — n) \ 
cy/2n) V (r + s) 2 ) 


1 / 2 ' 


+ 


n | A min{r, s} \ 1/2 | \a\b ( (s) n s(r + s - n) 


1/2 


(r + s)n c 2 \ n(r + s) J c 2 \ (r + s) n (r + s ) 2 

where K is a numerical constant and (m) n = m{m — 1) ■... ■ (m — n + 1) denotes the 
lower factorial. 

Proof. In this case, we clearly have 


nr 


a = 


a = 


r + s 
nr 

r + s 


7 2 = 


nr 


s r + s — n 


nr s(r + s — n) 


r + s r + s r + s - 1 r + s (r + s) 2 
nr 


and 


9 9 s r + s — n 

c 2 + a- -- 

r + s r + s — 1 


r + s 


9 9 s(r + s —n 
c 2 + a 2 v 


0 + s ) 2 


Hence, 


nr 


r + s 


< a 2 < 


nr 


r + s 


c z + a 


nr 


r + s/ r + s 


b 2 - ~ 2 


r + s 
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We use the coupling constructed in Example 12.31 (c) but write N for X and N s for 
X s , here. Recall that we have 

n 

D = N s - N = l {Xl=0} (l - Y i) > 0 and D = D 2 . 

3 =2 

Furthermore, we know that 

prni - Pin 2 ! - a (n - Var(iV) - s ( r + s ~ n ) 

E[D ] E ^ D J d M N , N ) e[N] ( r + s) 2 ' 

Elementary combinatorics yield 

E [Yj \Xi,, X n \ = r~ 1 l {X]=l} . 


Thus, 


1 

E [D | Ad, ...,X n ] = l{x 1= o} - ^2 1 {x j =i} = l{x 1= o} (i - — 

3 =2 


and 


'[D\N] = (l - ^Pix, = 0\N) = (l - j) 
(r — N)(n — N) 


N\n — N 


n 


nr 


N 


= f 1-) f 1- 

n 


N 


Using a computer algebra system, one may check that 
Var(i7 [D | N ~\) = (nrs — n 3 rs — r 2 s + 5 n 2 r 2 s + 2 n 3 r 2 s — 8 nr 3 s — 8n 2 r 3 s + 2 nrs 5 

— n 3 r 3 s + 4 r 4 s + 10nr 4 s + 3n 2 r 4 s — 4r 5 s — 3 nr 5 s + r 6 s + ns 2 

— n 3 s 2 — 2 rs 2 + 4n 2 rs 2 — 2 n 3 rs 2 — 14 nr 2 s 2 — 4 r n 2, r 2 s 2 + n 3 r 2 s 2 
+ 12r 3 s 2 + 20nr 3 s 2 + 2 n 2 r 3 s 2 — 14r 4 s 2 — 7 nr A s 2 + 4r 5 s 2 — s 3 

— n 2 s 3 + 2 n 3 s 3 — bnrs 3 + 4n 2 rs 3 + n 3 rs 3 + 13r 2 s 3 + 8 nr 2 s 3 

— 4 n 2 r 2 s 3 — 18r 3 s 3 — 3 nr 3 s 3 + 6r 4 s 3 + ns 4 — n 3 s 4 + 6 rs 4 — 4nrs 4 

— 2 n 2 rs 4 — 10r 2 s 4 + 3 nr 2 s 4 + 4r 3 s 4 + s 5 — 2ns 5 + n 2 s 5 — 2rs 5 

+ r 2 s 5 j ^nr(r + s) 2 (r + s — l) 2 (r + s — 2)(r + s — 3)j 
(2.7) =:e(n,r,s). 

One can check that under the assumption n < min{r, s} always 

min{r, s} 


-l 


e(n,r,s)=0 

\ n[r + s) 

Hence, there is a numerical constant K such that 


n -7 / v ( min{r, s}\ 1/2 

V e n , r , s < A / , \ 

\ n(r + s) J 


Also, by the conditional version of Jensen’s inequality 


(U[L> 2 |1V]) 2 < E[D a ] = E[D] = 


s(r + s — n) 
(r + s) 2 
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Using 




< 


2 (r + s 


r + 1 \ n 


-1 71+1 

E 


1=2 


r + 1 
l 


n + 1 — l 


< 


r + 1 \ n 


2 f r + s\ 1 /r + 1 + s\ 2(r + s + 1) r + s 


< 2 - 


n+1 J (n + l)(r + l) nr 


we get 


E[D 1 {n > 1 } N-V 2 ] < < (2 

s r + s — n\ 1 / 2 


s(r + s — n) r + s 


(r + s)(r + s — 1) nr 


1/2 


= v 2 ( — 


nr r + s — 1 


and 

r + s\ V 2 


£[l(iv>i ) iV- 1/2 ] < yUvEivU < V2(Eti) 
Finally, we have 

(n) _ s(s - 1) • ... • (s - n + 1) 


P(N = 0) = 


(s)r 


( r + s ) (r + s)(r + s — 1) ■ ... • (r + s — n + 1) (r + s) r 
Thus, the result follows from Theorem 12.51 


□ 


Remark 2.17. (a) From the above proof we see that the numerical constant K ap¬ 
pearing in the bounds of Corollary 12.161 could in principle be computed explicitly. 
Also, as always 

min{r, s} ^ r + s 1 
n(r + s) ~ nr E[N] ’ 

we conclude that the bounds are of order E[N]~ 1 ^ 2 . 

(b) One typical situation, in which a CLT for ffypergeometric random sums holds, 
is when N, itself, is asymptotically normal. Using the same coupling (N, N s ) 
as in the above proof and the results from [ GR96J . one obtains that under the 
condition 


( 2 . 8 ) 


max{r, s} 
n min{r, s} 


0 


the index N is asymptotically normal. This condition is stricter than that 


(2.9) 


EIA+ 1 = 


'f g 


nr 


0 , 


which implies the random sums CLT. For instance, choosing 


r oc n l+£ , and s oc n l+K 


with e, k > 0, then (12.81) holds, if and only if \e — k\ < 1, whereas (I2.9j) is 
equivalent to a — £ < 1 in this case. 
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Before we end this section by giving the proof of Lemma 12.11 we would like to 
mention in what respects the results in this article could be generalized. Firstly, 
it would be possible do dispense with the assumption of independence among the 
summands X \, X 2 ,.... Of course, the terms appearing in the bounds would look 
more complicated, but the only essential change would be the emergence of the 
additional error term 

E 3 := — E\aA N - /j,N\ , 
aa ' ' 


where 


N 

An E\Xj] and /a = E[An] 

3 = 1 


and where C is an explicit constant depending on the probabilistic distance chosen. 
Note that E 3 = 0 if the summands are either i.i.d. or centered. 

Secondly, it would be possible in principle to allow for some dependence among 
the summands X l} X 2 ,.... Indeed, an inspection of the proof in Section 0] reveals 
that this dependence should be such that for the non-random partial sums bounds on 
the normal approximation exist and such that suitable couplings with the non-zero 
biased distribution (see Section [3] ) of those partial sums are available. The latter, 
however, have not been constructed yet in great generality, although [GR97j gives a 
construction for summands forming a simple random sampling in the zero bias case. 

It would be much more difficult to abandon the assumption about the indepen¬ 
dence of the summation index and the summands. This can be seen from Equation 
(14. 9 p below, in which the second identity would no longer hold, in general, if this 
independence was no longer valid. Also, one would no longer be able to freely choose 
the coupling (N, N s ) when specializing to concrete distributions of N. 


Proof of Lemma \2.1[ Let h be a measurable function such that all the expected val¬ 
ues in (12.ip exist. By (12.ip we have 


( 2 . 10 ) 


£[/>(*■)] - E[h(X)} 


It is well known that 


1 

W] 


E[(X - E[X])h(X)} 


( 2 . 11 ) 


d TV (X,Y) = sup E[h{X)\ - E[h(Y)\ 

h&i 


where PL is the class of all measurable functions on M such that ||/i|| 00 < 1/2. If 
Halloo < 1/2, then 


1 

FT 


E[(X - E[X])h(X)] 


E\X — £[A']| 
2E[X] 


Hence, from (12.111) and (12.10p we conclude that 


dr V (X, X s ) < 


£|.Y-£[.Y]| 
2 E[X\ 


On the other hand, letting 



{x>E[X}} 


l {x<E[X]} 
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in (12.101) we have h £ T~L and obtain 


B\KX>)} - E{h(X)] 


2E[X\ 


proving the second equality of (a). Note that, since X s is stochastically larger than 
X, we have 


d K (X,X s ) = supIPpT > t) - P(X > t)| = 

t>o 

(2.12) = su P (p[^(X s )] - E[g t {X)]) , 


sup 

t> 0 


( P(X S >t)-P{X>t) y j 


whe re g t := l (ti00) . 

By (12.101) . choosing t = E[X] yields 

(2.13) d K (X,X s ) > E[(X - E[X}) 1 {X>E[X]}\ ■ 

If 0 < t < E[X] we obtain 

E[{X - E[X])l {x>t} ] = E[(X - E[X])l {t<xmx]} } +E[(X - E[X])l {x>E[x]} ] 

(2.14) <E[(X-E[X}) 1 {X>E[X]} \. 

Also, if t > E[X], then 

(2.15) E[(X - E[X])l {x>t} ] < E[(X - E[X])1 {x>e[x]} ] . 

Thus, by (j2.10|) . from (12.121) . (I2.13p . (12.141) and (12.151) we conclude that 

(2.16) d K (X,X s ) = E[(X - E[X])1 {x>e[x]} ] . 

Now, the remaining claim of (a) can be easily inferred from (j2 . 16[) and from the 
following two identities: 

0 = E[X- E[X}] = E[(X - E[X]) 1 {X>E[X]} \ ~E[(X - E[X]) 1 {X < E[X]} ] 

= E[\X - E[X]\1 {X>E[X]} ] ~E[\X - E[X]\1 {X < E[X]} ] 

and 


E\X - E[X] | = E[\X - E[X]\1 {X>E[X]} \ +E[\X - E[X]\l {x < E[ x\}] 
= 2E[(X-E[X])l {x>Elxl} ]. 

Finally, if h is 1-Lipschitz continuous, then 


B[(.Y-B[.Y])ft(V)] 


= S[(X-S[X])(A(J>f)-A(HV]))] 
< l|fc'IUB[|A' - B(A']| 2 ] = Var(A). 


On the other hand, the function h(x) := x — E[X] is 1-Lipschitz and 

E[(X - E[X])h(X)] = Var(X). 


Thus, also (b) is proved. 


□ 
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3 . Elements of Stein’s method 


In this section we review some well-known and also some recent results about 
Stein’s method of normal approximation. Our general reference for this topic is 
the book |CGSll| . Throughout, Z will denote a standard normal random variable. 
Stein’s method originated from Stein’s seminal observation (see |Ste72j ) that a real¬ 
valued random variable X has the standard normal distribution, if and only if the 
identity 

E[f'(X)]=E[Xf( A')] 

holds for each, say, continuously differentiable function / with bounded derivative. 
For a given random variable W, which is supposed to be asymptotically normal, and 
a Borel-measurable test function h on K. with E\h(Z)\ < oo it was then Stein’s idea 
to solve the Stein equation 

(3.1) f(x)-xf(x)=h(x)-E[h(Z)\ 


and to use properties of the solution / and of W in order to bound the right hand 
side of 


E[h(W)] -E[h(Z)\ 


E[f(W)-Wf(W)] 


rather than bounc 


ing the left hand side directly. For h as above, by ff we denote 


the standard solution to the Stein equation (13.11) which is given by 


(3.2) 


f h (x) =e*’' 2 T (h(t) - E[h(Z)])e- ,2 / 2 dt 

J — OO 


poo 

= -e x2/2 / (h(t) - E[h(Z)])e~ t2/2 dt . 

J X 


Note that, generally, fh is only differentiable and satisfies (13.11) at the continuity 
points of h. In order to be able to deal with distributions which might have point 
masses, if x G R is a point at which fh is not differentiable, one defines 


(3.3) f'h(x) := xf h (x ) + h(x) - E[h(Z)\ 

such that, by definition, fh. satisfies (13 .1 1) at each point x G R. This gives a Borel- 
measurable version of the derivative of fh in the Lebesgue sense. Properties of the 
solutions fh for various classes of test functions h have been studied. Since we are 
only interested in the Kolmogorov and Wasserstein distances, we either suppose that 
h is 1-Lipschitz or that h = h~ = l(_oo,zl f° r some z G M. In the latter case we write 

fz for f hz . 

We need the following properties of the solutions fh- If h is 1-Lipschitz, then it is 
well known (see e.g. | lCGSll| ) that fh is continuously differentiable and that both fh 
and f' h are Lipschitz-continuous with 


(3.4) 


IIA 


oo fr 



and \\fh\\oo < 2. 


Here, for a function g on M, we denote by 


I w 


sup 

x^y 


\g(x)-g(y)\ 

\x - y I 


its minimum Lipschitz constant. Note that if g is absolutely continuous, then ||5^||oo 
coincides with the essential supremum norm of the derivative of g in the Lebesgue 
sense. Hence, the double use of the symbol ||• ||oo does not cause any problems. For 
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an absolutely continuous function g on R, a fixed choice of its derivative g' and for 
x,y E R we let 


(3.5) 


R,(x,y) ■= g(x + y)~ g(x) - g'(x)y 


denote the remainder term of its first order Taylor expansion around x at the point 
x + y. If h is 1-Lipschitz, then we obtain for all x,y E R that 


(3-6) \R fh (x,y)\ = | f h (x + y) - f h (x) - f' h {x)y\ < y 1 . 

This follows from (13.41) via 


| fh(x + y)~ f h (x ) - f' h (x)y | = / - f h (x))dt 


r x+y 


<11 fi 


Hi loo 


r%+y 


11 — x\dt 


y 2 \\f, 


< y 


For h = h z we list the following properties of f z : The function f z has the represen¬ 
tation 


(3.7) 


fz(x) 


(1 

ip(x) ’ 

ip(x) ’ 


x < z 

X > z. 


Here, $ denotes the standard normal distribution function and <p := < h / the corre¬ 
sponding continuous density. It is easy to see from (13.71) that f z is infinitely often 
differentiable on R\{z}. Furthermore, it is well-known that f z is Lipschitz-continuous 
with Lipschitz constant 1 and that it satisfies 


0 < f z (x) < /o(0) 



x,z e R. 


These properties already easily yield that for all x,u,v,z E R 


(3.8) 


I (x + u)f z (x + u)-(x + v)f z (x + v) I < 



(|n| + M) . 


Proofs of the above mentioned classic facts about the functions f z can again be found 
in |CGS11] . for instance. As f z is not differentiable at z (the right and left derivatives 
do exist but are not equal) by the above Convention (13.31) we define 

(3.9) f z (z) := zf z (z ) + 1 - <F(z) 

such that / = f z satisfies (13.ID with h = h z for all x E R. Furthermore, with this 
definition, for all x, z E R we have 

(3.10) i/:wi<i. 

The following quantitative version of the first order Taylor approximation of f z has 
recently been proved by La.chieze-R.ey and Peccati |LR.P15 j and had already been 
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used implicitly in |ET14j . Using (13.91) . for all x, u, z G R we have 
\ R f z ( x ,u)\ = | fz(x + u) - f z {x) - f' s (x)u | 


< 


u 


(3,11) 


u 

y 


X + 


X + 


\Z2tt 

— 

y/2n' 


"h |m| ( 1{ X<Z<X+U } 1{ x+u<z<x } 


+ lull 


jy— (u\/0)<x<z—(uAO) }■ ’ 


where, here and elsewhere, we write xVj/:= max(x, y) and x A y := min(x, y). 


For the proof of Theorems 12.51 and 12.71 we need to recall a certain coupling con¬ 
struction, which has been efficiently used in Stein’s method of normal approximation: 
Let X be a real-valued random variable such that E[X\ = 0 and 0 < 17[A" 2 ] < oo. 
In [ GR97 ] it was proved that there exists a unique distribution for a random variable 
X* such that for all Lipschitz continuous functions / the identity 

(3.12) E[Xf{X)\ = Var (X)E[f(X*)} 

holds true. The distribution of A"* is called the X-zero biased distribution and the 
distributional transformation which maps C(X) to C(X*) is called the zero bias 
transformation. It can be shown that (13.121) holds for all absolutely continuous func¬ 
tions / on 1 such that E\Xf(X)\ < oo. From the Stein characterization of the 
family of normal distributions it is immediate that the fixed points of the zero bias 
transformation are exactly the centered normal distributions. Thus, if, for a given 
A", the distribution of X* is close to that of A", the distribution of X is approxi¬ 
mately a fixed point of this transformation and, hence, should be close to the normal 
distribution with the same variance as A". In |Gol04] this heuristic was made precise 
by showing the inequality 

d w (X,crZ) < 2 d w (X,X*), 

where X is a mean zero random variable with 0 < a 2 = E[X 2 } = Var (A") < oo, X* 
having the X-zero biased distribution is defined on the same probability space as A" 
and Z is standard normally distributed. For merely technical reasons we introduce a 
variant of the zero bias transformation for not necessarily centered random variables. 
Thus, if X is a real random variable with 0 < X[A" 2 ] < oo, we say that a random 
variable X nz has the X-non-zero biased distribution , if for all Lipschitz-continuous 
functions / it holds that 

E[(X - £[X])/(X)] = Var(A')£[/'(.Y’“)] . 

Existence and uniqueness of the A"-non-zero biased distribution immediately follow 
from Theorem 2.1 of [ GR05 j (or Theorem 2.1 of |Dobl5| by letting Bix) = x — E[X], 
there). Alternatively, letting Y := X — E[X] and Y* have the Y'-zero biased dis¬ 
tribution, it is easy to see that X nz := Y* + E[X] fulfills the requirements for the 
A"-non-zero biased distribution. Most of the properties of the zero bias transforma¬ 
tion have natural analogs for the non-zero bias transformation, so we do not list 
them all, here. Since an important part of the proof of our main result relies on 
the so-called single summand property , however, we state the result for the sake of 
reference. 
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Lemma 3.1 (single summand property). Let X t ,, X n be independent random 

variables such that 0 < E[X 2 ] < oo, j = 1,..., n. Define a 2 := Var(X,-) ; 

j — 1,... ,n, S J2'j =i x j and a 2 := Var(S') = Y^Jj=i a j■ P° r eac h J = 1; • • • > 71 ^ 

X’f z have the Xj-non-zero biased distribution and be independent of 

Xi ,..., Xj_i, Xj + i ,..., X n and let I e {1,..., n} be a random index, independent of 

all the rest and such that 

2 

P(I = j) = - f, j = 

Then, the random variable 

n 

S nz :=S- Xj + xr = Y, 1(7=1} (XI X i + X i Z ) 

i =1 j^i 

has the S-non-zero biased distribution. 

Proof. The proof is either analogous to the proof of Lemma 2.1 in |GR97] or else, the 
statement could be deduced from this result in the following way: Using the fact that 
X nz = Y* + E[X\ has the A"-non-zero biased distribution if and only if Y* has the 
{X — E[X ])-zero biased distribution, we Let Y) := Xj — E[Xf\, Y* := Xf z — E[Xf\, 
j = 1 ,... ,n and W := YTj=\ Yj — <5' — E[S], Then, from Lemma 2.1 in [GR.97 ] we 
know that 

n 

W- := w — y, + y; = s-E[s] + Yi i {/ .„( e[-W - x, + xp - e[x,}) 

3 = 1 

= 5 - X! + Xj z -E[S\ = S nz - E[S\ 

has the lU-zero biased distribution, implying that S nz has the S'-non-zero biased 
distribution. 

□ 


4. Proof of Theorems 12.51 and 12771 

From now on we let h be either 1-Lipschitz or h = h z for some z£l and write 
/ — fh given by (13.21) . Since / is a solution to (13.1L plugging in W and taking 
expectations yields 

(4.1) E[h(W)\ - E[h(Z )] = E\f\W) - Wf(W)\. 

As usual in Stein’s method of normal approximation, the main task is to rewrite the 
term E[Wf(W)] into a more tractable expression be exploiting the structure of W 
and using properties of /. From (12.41) we have 

(4.2) E[Wf{W)} = -E[(S - aN)f{W )] + -E[(N - a)f(W)] =: + T 2 . 

a a 

For ease of notation, for n e Z + and M any Z + -valued random variable we let 

n M ^ 

o tt - tt r o n — aa v , ttt om ~ aa 

S n := y Xj, W n \= -, Sm := > X 3 and W M ■= -, 

' o L —' a 

3=1 3=1 

such that, in particular, S = Sn and W = Wn- Using the decomposition 
E[f\W)] = E-E[f'(W)\+EfE[f(W)} 
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which is true by virtue of (12.31) . from (14.11) and (14.21) we have 
E[h(W)\ - E[h(Z )] = E[f(W)\ -T x -T 2 


= E 


r c 2 a .. 1 


<7 


f'(W) - (S — aN)f{W) 


a 


(4,3) 


+ E 


— —{N — a)f(W) 

L u z a 


= : E 1 + E 


2 • 


We will bound the terms E\ and E 2 seperately. Using the independence of N and 
X\, X- 2 ,... for T\ we obtain: 

1 


T 1 = ~Y j P(N = n)E[{S n - na)f(W n )] 


71=0 

oo 


(4.4) 

where 


- ^ P( N = n)E [(5 n - na)g(S n )\ , 


71=0 


g(x) ■= f 


x — aa 


a 


Thus, if, for each n > 0, S™ z has the S^-non-zero biased distribution, from 
()3|) we obtain that 

1 


and 


T 1 = ~Y j P(N= n) Var(S„)S[ 9 '(S"')] 

- S™ -aa 


n=0 
2 00 


cr- 


^nP(N = n)E 


n =0 


r 


a 


Note that if we let M be independent of S™ z , S % z ,... and have the N -size biased 
distribution, then, this implies that 


(4.5) 

where 


(?OL 

T\ = —^E 
cH 


f 


i 


qr, 

J A 


aa 


a 


qnz _ \ ' I qnz 

_ i {M=n}*J n • 


71=1 


We use Lemma O for the construction of the variables S™ z , n G N. Note, however, 
that by the i.i.d. property of the Xj we actually do not need the mixing index /, 
here. Hence, we construct independent random variables 

{N,M),X x ,X 2 ,... and Y 

such that M has the N -size biased distribution and such that Y has the X!-non-zero 
biased distribution. Then, for all n G N 

S .T := S n - Ad + Y 

has the ^-non-zero biased distribution and we have 

- aa _ S M ~ aa , Y - X ] 


(4.6) 


a 


a 


+ 


a 


= W M + -—— =: W*. 


a 
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Thus, from (14.61) and (14.51) we conclude that 

(y 

(4.7) Tl = —E[f(W*)\ 

cr z 

and 

(4.8) = 

We would like to mention that if a = 0, then, by (14.71) . W* has the W-zero biased 
distribution as T 2 = 0 and a 2 = c 2 a in this case. Before addressing T 2 , we remark 
that the random variables appearing in E\ and E 2 . respectively, could possibly be 
dehned on different probability spaces, if convenient, since they do not appear under 
the same expectation sign. Indeed, for E 2 we use the coupling (N,N S ), which is 
given in the statements of Theorems 12.51 and 12.71 and which appears in the bounds 
via the difference D = N s — N. In order to manipulate E 2 we thus assume that the 
random variables 


(N,N’),X u X 2 ,... 

are independent and that N s has the TV-size biased distribution. Note that we do 
not assume here that D = N s — N > 0, since sometimes a natural coupling yielding 
a small value of \D\ does not satisfy this nonnegativity condition. In what follows 
we will use the notation 


I Wjsrs — Wn — — (Sjys — S N ) and J 1{d>o} — 1 {n s >n} ■ 


Now we turn to rewriting T 2 . Using the independence of N and Xi,X 2 ,..., and that 
of N s and Xi, X 2 ,..., respectively, E[N] = a and the defining equation (12.11) of the 
N -size biased distribution, we obtain from (14.21) that 


T 2 = —E[(N - a)f{W N )] = —E[f{W N s) - f(W N )] 
a a 


aa 


aa 


l{N°<N}{f(W N s) - f(W N )) 


= —E\l {N .> N} {f(W N ') - f(W N ))j + — E 
= ~E [J (/ (Wn + V)~ f(W N ))] - ™E[{ 1 - J)(f(W N , -V)- f(W N .))] 
= —E[JVf'{W N )\ + —E[jR f (W N ,V)\ 


<7 


a 


(4.9) 


+ —B[( 1 - J)Vf(W N .)] - —E[( 1 - J)R,(W n ,,-V )] , 
where Rf was dehned in (13.5ft . Note that we have 


N s 


JV = 1 


{N S >N} 


a 


Xj and W N = 


El 1 x j - aa 


j=N+1 


a 
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1 


N 


and, hence, the random variables JV and Wn are conditionally independent given 
N. Noting also that 

N s 

j E x i 

L j=N +1 
N 3 

jb[ V Xj 

j=N+1 


E[JV\N] = -E 


a 


= -E 
a 


N, N s 


N 


= -E[JD\N] = —E[JD | N] 


a 


<7 


we obtain that 


(\n cvn 

—E[JVf\W N )\ = —E [E [JV I N] E [f'(W N ) \ N] 


aa 

a 


(4.10) 


aa 


a z 
aa 2 

aa 2 


-E 


< 7 - 


E[JD\N]E[f(W N )\N] 
E [_JDf(W N ) | N] 
JDf\W N ) 


where we have used for the next to last equality that also D and Wn are conditionally 
independent given N. In a similar fashion, using that W^ 3 and 1 {d<o}I^ and also 
Wn 3 and D are conditionally independent given N s , one can show 

an 2 

(4.11) rz E [( 1 - J)Vf(W N s)\ = —E[( 1 - J)Df\W NS )\ . 


aa 
a 

Hence, using that 


^E[D] = ^E[N‘ - JV] = ^± 
o l a- a- a 


a 2 7 2 

a 2 


from (14.31) . (14.91) . (14.101) and (I4.11j) we obtain 


E, = 


aa 

a 2 

aa 


E 


(■ E[D]-D)f(W N ) 


aa ^ 

+ —E 

a 2 


(1 -J)D(f(W N )-f'(W N .)) 


- —E[jR f (W N , V)] + —E[( 1 - J)R f (W N s,-V )] 


a 


a 


(4.12) —: i?2,l + i?2,2 + £j2,3 + ^2,4 • 

Using the conditional independence of D and Wn given N as well as the Cauchy- 
Schwarz inequality, we can estimate 

-.2 


1 ^ 2,11 — 


aa 


17* 


E 


E[D — E[D] | N] E lf\W N ) | N] 


(4.13) 


aa 

< — 

a 2 

aa 2 

< - 

a 2 


Var (E[D \ N]) ^ E [(U [f'(W N ) \ N] 
||/ , || 00 y / Var(U[Zl | iV]) . 


Now we will proceed by first assuming that h is a 1-Lipschitz function. In this 
case, we choose the coupling (M, N) used for E\ in such a way that M > N. By 
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Remark 12.21 fa) such a construction of (M, N) is always possible e.g. via the quantile 
transformation und that it achieves the Wasserstein distance, i.e. 

E\M-N\ = E[M -N] = - E[N] = = £ = d w (N,N‘). 

In order to bound Ei, we first derive an estimate for E\Wm — W^\. We have 


E[\W m -W n \\N,M] = ^E[\S m ~ S n \\N,M] < _ N ^ E\Xu 


<7 


(4.14) 

and, hence, 


< 


b(M - N ) 


cr 


E\W m ~W n \ =E\E[\Wm-W n \\N,M]\ = -E[\S m ~ S n \\N,M] 


(4.15) 


b ba 2 

< —E[M — N] — —— . 
a aa 


Then, using (13.41) . (j4.15[) as well as the fact that the X 3 are i.i.d., for E\ we obtain 
that 


c 2 a 


Ei =^r 


cr- 


f’(W N )~ f'(w M + 


Y - Ad 


cr 


(4.16) 


(4.17) 


2 c 2 ct 

< - 

cr 2 


(e\W n - W M \ + (J~ l E\Y - All) 


2 c 2 & 7 2 3 ad 3 

= -+ —=-■ 


< 7 ° 


( 7 ° 


Here, we have used the inequality 


-£LY 1 -£[A' 1 ]| j , 


(4.18) E\Y - A4| = E\Y - E[Xi] - (A, - B[A,]) | < 2 . 

which follows from an analogous one in the zero-bias framework (see [ CGSllj ) via 
the fact that Y — E[X\] has the (A — E[Xi]) - zero biased distribution. 

Similarly to (j4.14[) we obtain 

E[\V\\N,N S ] = E[\W N s -W n \\N,N s ] < b \ NS ~ N \ - b \ D \ 

which, together with (13.41) yields that 


cr 


a 


\E2 } 2\ — 
(4.19) 


aa 


<7 


2 aa- 


E 


E 


(1 -J)D(f{W N )-f{W N .)) 


2aa- 


< 

~ a 2 


E |(1 - J)D(W n - IWvO| 


<7 


(1- J)\D\E[\V\\N,N 8 ] 


2 aa 2 b 


< 

a 3 


E{(1-J)D 2 }. 


We conclude the proof of the Wasserstein bounds by estimating E- 1?i and E 2> 4 . Note 
that by (13.61) we have 

\R f (W N ,V) \ < V 2 and \R f (W N s,-V)\ < V 2 
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yielding 

(4.20) 

Observe that 


I E. 


2,3 


+ | ^ 2 , 41 ^ 


1 


a\a\ 

a 


E 


(l{£)> 0 } + 1 {D< 0 })^' 


a\a\ 

a 


E[V 2 ] . 


E[V 2 } = —77 (S N . - S N ) 

<j~ L 
1 


= —(vai(S N . - S N ) + (E[S n . - S’jv]) 2 ) 


ay 


a 


(4.21) 
and 

(4.22) E [S N s -S n ]=E \e [S N s -S n \N, N s ] 1 = aE[D} = 

Further, from the variance decomposition formula we obtain 

Var (S N s -S N ) =E [Vax(5jv. -S n \N, TV 5 )] + Var(£ [S N . - S N , \ TV, TV 5 ]) 
= E[c 2 \D\] +Var (aD) = c 2 E\D\ + a 2 Var(D). 

This together with (14.21(1 and (14.2211 yields the bounds 

(4.23) E[V 2 } = E[(W N s - W N ) 2 ] = ^{c 2 E\D\ +a 2 E[D 2 \ s j 


(4.24) 


h 2 

< W e\d 2 ] , 


cr^ 


where we have used the fact that D 2 > \D\ and a 2 + c 2 = b 2 to obtain 

c 2 E\D\ + a 2 E[D 2 ] < b 2 E[D 2 }. 

The asserted bound on the Wasserstein distance between W and Z from Theorem 
ESnow follows from (J33}, (S3D, (14T2D . (1X1711 . (14T91) . (14201) and (14241) . 

If a = 0, then E\ can be bounded more accurately than we did before. Indeed, using 
(14.231) with N s = M and applying the Cauchy-Schwarz inequality give 

cy 

\/ £j\1V1 —IV | = - 

a 

as c = b in this case. Plugging this into (14.161) . we obtain 

2c 2 cd 


E\W, 


M 


W N I < 


>Je[(Wm - w N y] = —\Je\m - tv] = 




l^il < 


o- 


(e\W m - W N \ + a~'E\Y - X,|) 






<j 


< 


2c 


3 ad 3 
c 3 a 3 / 2 ' c 3 a 3 / 2 

2y 3 d 3 

a ”^ c 3 \fa ’ 


which is the Wasserstein bound claimed in Theorem 12.71 


Next, we proceed to the proof of the Berry-Esseen bounds in Theorems 12.51 and 
12.71 Bounding the quantities E \, 772 , 2 , 772,3 and 772,4 hi the case that h = h z is much 
more technically involved. Also, in this case we do not in general profit from choosing 
M appearing in T\ in such a way that M > N. This is why we let M = N s for the 
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proof of the Kolmogorov bound in Theorem 12.51 Only for the proof of Theorem 12.71 
we will later assume that M > N. We write / = f z and introduce the notation 

V := W* - W = W N s + a~\Y - Ad) - W N = V + a~\Y - Ad). 

From (I4.8[) and the fact that / solves the Stein equation (13.ID for h = h z we have 


c 2 a 


E t = —E[f'(W) - f(W)} 

9 

C (Y i i r 

= —E[Wf(W) - W*f(W*)] + —( P(W <z)~ P(W* < z)) 


c 2 a 


(4.25) —: E\ i + Ei 2 ■ 

In order to bound Fd,i we apply (13.8j) to obtain 


(4.26) 


l-^i,11 < 


c 2 a 


a A 


E 


\v\l^ + \w\ 


Using (14.231) . (14.241 ) and (14.181) we have 


(4.27) 

(4.28) 


E\V\ < E\V\ + a~ l E\Y - Ad| < y/E[V 2 ] + 

1 3d 3 

= -^/c 2 E\D\ + a 2 E[D 2 } + —— 
a lac 


3d 3 
2a c 2 


Furthermore, using independence of W and Y, we have 

E\(Y -Xi)W\ < E\(Y - E[Xi])W\+ E\{Xi- E[Xi])W\ 

= E\Y - E[Xi]\E\W\ + E\{Xi -E[X 1 ])W\ 


(4.29) 

Finally, we have 


r i 3 _ ^3 

< ^E\W 2 } + AMVUEpUl = + c. 


(4.30) 

(4.31) 


E\VW\ < y/E[V 2 ]y/E[W 2 \ = -yjc 2 E\D\ + a 2 E[D 2 


a 


< -Ve[d 2 }. 

a 


From (14.261) . (14.271) . (14.281) . (14.291) . (14.301) and (14.311) we conclude that 

ijm < +^ r + l - 

8c z a 2c z a a 


a 2 V 4a 
^-y / c 2 E\D\+a 2 E[D 2 ' 


(4.32) 

(4.33) 


a 


< 


cMx/af + 4) v + dM 3^ + 4 > + 

4<r 3 vii 8(j 3 - 3 

(\/27r + 4 )bc 2 a , d 3 a( 3^27?+ 4) c 3 a 

-4^- VE[D ] + - 8 ^- + ^ =: ’ 


In order to bound E i 2 we need the following lemma, which will be proved in Section 
El In the following we denote by Cjc the Berry-Esseen constant for sums of i.i.d. 
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random variables with finite third moment. It is known from [Shell) that 

C K < 0.4748. 

In particular, 2Cjc < 1, which is substituted for 2C/c in the statements of Theorems 
I2.5l and l2.7l However, we prefer keeping the dependence of the bounds on Cjc explicit 
within the proof. 


LEMMA 4.1. With the above assumptions and notation we have for all z G M 

(4.34) 

\P(W* < z) ~P(W N s < z)\ < -^=(-V2 + 2)ir and 
' 1 y/a V 2 / c 6 

| P(W N , <z)~ P{W < z)\ < P{N = 0) + ^=E[Dl {Dm N- l / 2 l {N > l} \ 

c\Z2tt 

H-®[ 1 {b>o>^ 1 / 2 1 (w>i)] 

(435) + + - 

If a = 0 and D > 0, then for all z G M 

9P d 3 

| P{W N . <z)~ P(W <z) | < P(N = 0) + — E[N-^ 2 1 {n > 1} ] 

(4.36) + -L E [v / ZlA^- 1 / 2 l { Ar> 1} ] 

V 2tt 

(4.37) < m = 0) + . 


Applying the triangle inequality to Lemma 14.11 yields the following bounds on E\^‘- 
In the most general situation (Theorem 12.51 and Remark 12.61 (b)) we have 


I-Ei ,2 | < 


(^ + 2 ) ^ + £„<* = 0 ) 4 V„l 


(4.38) 


+ 


^hB[i {D >„,iv-^i {K21) ] + 1, D<0| ] 

aCjcd 3 


ca z 


-a/ P(D < 0) =: P-2 


If a = 0 and D > 0, then, keeping in mind that a 2 = etc 2 in this case, 


Tul < (fV2 + 2 ) AL + P(N = 0) + 2 P£e [N-^ !,„>„] 

+ -^=e[Vdn~ 1 / 2 i { n > 1} \ 

< 439 > * (5^ - +2 ) -JTs + P(N = 10) - + • 


The following lemma, which is also proved in Section 0 will be needed to bound the 
quantities E 2 , 2 , -E 2) 3 and E 2 ^ from (14.121) . 
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Lemma 4.2. With the above assumptions arid notation we have 
E [j\ V\ l{ z -(y V 0)<W<a- (FAO)}] < = « WE\JD*\ 

ccrv 2tt c°a 

£[(1 - J)|V'|l {l+( „ A „ ) < w „,< rt(v . V G) ) ] < -L=b[( 1 - J)D 2 (N‘)- 112 ] 

cay Ztt 

+ ^^Je[( 1- J)D 2 ] and 
c 6 ay/a v 

E[(l - J)|D|1 { « +( vao)< Wa ,.<,+(wo)}] < -4=S[( 1 - J)D 2 (A r T 1/2 ] 

CV Z7T 

Next, we derive a bound on E 22 - Since / solves the Stein equation (13.111 for h = h~ 
we have 

OLO? 

E 2 ,2 = -^-£[(1 - J)D(W N f(W N ) - Wjv./(W^.))] 

OLO? 

(4.43) =: £( 2 , 2,1 + £" 2 , 2,2 • 

Using 

lUv = Wat* - U 


(4.40) 


(4.41) 


(4.42) 


and Lemma [5.11 we obtain from (14.421) that 


£^ 2 , 2 , 2 1 < _ ^ 2 _ £ 2 [l{u<o}|£ ) |l{ z+{Vf\0)<W N s <2+(W0)}] 


< 


aa 2 b 


o^cs/Z/k 


r 2 

= : B, 


E[i lD<0) D\Nr' /2 } + W^l 


(4.44) 

As to £( 2 , 2 . 1 , from (j3.8[) we have 


(4.45) 

As 


\E‘ 


■ aar 
2 , 2,1 A — ir& 
a z 


(1- J)|W|(|W^.| + 


\/27T\ 

4 y 


(4.46) £?[|y| | N, N s ] < ^E[V 2 \ N,N S ] = -^c 2 \D\+a 2 D 2 < ^\D \, 

by conditioning, we see 


(4.47) E[(l-J)\DV\] = E (1-J)\D\E[\V\\N,N S ] < -E [(1 - J)D 2 }. 
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Now, using the fact that conditionally on N s , the random variables Wn» and (1 — 
J)\DV\ are independent, as well as the Cauchy-Schwarz inequality, we conclude that 

e\(i - j)dvw N s\ = e\e[\(i- J)DVW N s\\N s ] 


= E 


E[( 1 - J)\DV\ | N s ]E[\W N s\ | N s ] 


(4.48) 


<^E 


J)\pv\ 


r 

(e[\w ns \ 1 at -]) 2 

- J)D 2 


\I E 

W N* 



where we have used the conditional Jensen inequality, (14.461) and 
E[(1-J)\DV\ | IV s ] = e\(1- J)\D\E[\V\\N,N S ] 


N s 


to obtain the last inequality. Using the defining relation (12.11) of the size-biased 
distribution one can easily show that 


(4.49) E 


W, 


N s 


= ~:E 


U 


c 2 N s + a\N s - aY 


c 2 /3 2 + a 2 (J 3 — 2 a/3 2 + a 3 ) 


acr- 


whicli, together with (j4.45[) , (|4.47[) and (|4.48[) yields that 


^ 2 , 2 , 1 | < 


aa 2 6v / 2vr 


4a 3 


E[1{d<o}D 2 


2 ^c 2 /3 2 + a 2 (J 3 — 2a/3 2 + a 3 ) 




(E[1{d<o}D 2 | N S ~\Y 


(4.50) 


= : B. 


5 • 


It remains to bound the quantities U 2j 3 and U 2i 4 from (I4.12p for / = f z . From 
(13.lip we have 


-^2,3! — 


a a 


< 


(4.51) 


+ 


cr 

a\a\ 

~2a 

ala 


E[l {D > 0} R f (W,V)\ 
\Z2n\ 


E 

-E 


JV 2 [\W\ + 


4 )\ 


o 


J\V\l{z-(V\/0)<W<z-{VA0)} —■ Rl,l + -Rl,2 • 


Similarly to (I4.23p we obtain 

E[JV 2 } = 4^E[JD}+a 2 E[JD 2 }) 

(4.52) <!^E[JD 2 ] 

from 

E[JV 2 \N,N‘] = JE[V 2 \N,N‘] = 4(c 2 \D\ +a 2 D 2 'j 


(4.53) 


<7 


c 2 JD + a 2 JD : 
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Also, recall that the random variables 


JV 2 = <7 -1 l 


N s 2 N 

{n s >n} p XjJ and Wn = cr 1 Xj 

j=N +1 j=1 


aa 


are conditionally independent given N. Hence, using the Cauchy-Schwarz inequality 


E[JV 2 \W n \\=E E[JV 2 \W n \ I N] =E E[JV 2 \N]E[\W n \\N] 


(4.54) 


< i E 


E[JV 2 \N] 


1 E 


E \W N \ N 


From (14.531) and D 2 > \ D | we conclude that 

1 


(4.55) E [JV 2 | N]= ^ [c 2 E [JD \ N] + a 2 E [JD 2 \ N ]) < ^E [« JD 2 | N] . 
Furthermore, by the conditional version of Jensen’s inequality we have 


(4.56) 


E 


< E 


e[w 2 \n] =e[w 2 \=i. 


e[\w n \ \n 

Thus, from (14.54[) . (I4.55P and (14.56[) we see that 

(4.57) E[JV 2 \W n \] <^E\ i (E[JD 2 \N]) 
Hence, (14.511) . (14.521) and (j4.57[) yield 

(4.58) 


R hl <^XjE 


2 a 3 


(E[jD 2 \N]y 


+ 


a\a 


|fe 2 \/27r 


8 cr 3 


E[JD 2 }. 


Finally, from (I4.5ip . (I4.58P and (I4.40p we get 


. „ . a\a\b 2 , „ 

a\a\b 


(E[l {D > 0} D 2 \N]y 


+ 


a a 


\b 2 y/2Tl 


8 a 3 


E[1{d>o}D~ 


+ E^E E {i id> 0 } dh [n> 1 } n-^] + 2CK -^ Hi E[i [D>0) Di iN> 1} jv-iq 

ca 2 \/27T cr 


(4.59) 


=:B e . 

Similarly, we have 


I-Eg 4 1 — 


a\a\ 


(4.60) 


< 

+ 


a 

a\a\ 

a\a 


a 


E[l{D<o}Ef(^N s , ~ y)\ 

(1-J)H 2 (HW^)j 

(1 — J)\V\l{ z+ (y/\0)<W N s<z+{V\/0)} = : -^ 2,1 + -^ 2,2 


E 

-E 
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Analogously to the above we obtain 

£[(1 - J)V 2 ] = 4(c 2 £[(l - J)|£>|] + a 2 £[(l - J)£> 2 ]) 


(4.61) 


<^E[(1-J)D 2 ] and 


a 


E[(l - J)V 2 \ N s ] = -^(c 2 E[(1- J)\D\\N S ] +a 2 E[(l- J)D 2 \N s ]^J 


a 

b 2 


< —E[(l - J)D 2 | N s ] . 


a 


Using these as well as the conditional independence of (1 — J)V 2 and Wn« given N s , 
one has 


(4.62) 


b 2 


E[(l-J)V 2 \W N s\] < —Je (E[(1-J)D 2 \N°]) 


a 


E 


w 2 ns 


Combining (14.491) and (14.62ft we obtain 

b 2 


(4.63) 


E[(1-J)V 2 \W N s\\ <—\ E (E[(1-J)D 2 \N°]) 


a 


' c 2 (3 2 + a 2 (h 3 — 2a(3 2 + a 3 ) 


1/2 


aa z 


Thus, from (14.601) . (14.61 p . (I4.63jl and (14.41 jl we conclude 


1-^2,41 < 


a\a 


\b 2 y/27T 


8 <r 3 


E\1{d<o}D 2 ] + 


a|a|6 2 


2 cr 3 


E (E[1 {d< 0 } D 2 \N°]Y 


c 2 /3 2 + a 2 (<5 3 — 2 a/3 2 + a 3 ) 


1/2 




+ 


a a 


o^cs/Ek 


1 E E [ 1 (d< 0 ]D \n’)- 1/2 } + 


yj 2 ' 

(4.64) =: B 7 . 

The Berry-Esseen bound stated in Remark l2Tl (b) follows from (14.31) . (14.251) . (14.331) . 
(14381) . (14T2D . (14431) . (13401) . (14431) . (14441) . (14301) . (14591) and (14641) . This immedi¬ 
ately yields the Berry-Esseen bound presented in Theorem 12.51 (b) because 

By = B 4 = B§ = Bj = 0 

in this case. In order to obtain the Kolmogorov bound in Theorem 12.71 again, we 
choose M such that M > N and use the bounds (14.321) and (14.391) instead. The 
result then follows from (14.31) and (14.251) . 

5. Proofs of auxiliary results 

Here, we give several rather technical proofs. We start with the following easy 
lemma, whose proof is omitted. 

Lemma 5.1. For all x, u, b,z£K we have 

l{aH -u<z} l{a :+v<z} ^-{z—v<x<z—u} 1 {z—u<x<z—v} and 


L{ x+u<z } *-{x-\-v<z} 


= 


{z—uVv<x<z—uAv} 
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Lemma 5.2 (Concentration inequality). For all realt < u and for all n > 1 we have 


P(t < W n < u) < 


°{u - t) 2 Cjcdf 
c\pl^yjn (?\fn 


Proof. The proof uses the Berry-Esseen Theorem for sums of i.i.d. random variables 
with finite third moment as well as the following fact, whose proof is straightforward: 
For each real-valued random variable X and for all real r < s we have the bound 


(5.1) 


P(r < X < s) < + m k(A', Z). 

V 27 T 


A similar result was used in |PR11] in the framework of exponential approximation. 
Now, for given t < u and n > 1 by (15. ip and the Berry-Esseen Theorem we have 


Pit < W n <u) = P 


at + a{a — n ) S n — na < au + a(a — n) 


own 


C\ n 


C\/n 


a{u — t ) 2 C/cd 3 

cV^rr^n c 3 ^/n ' 


□ 

Remark 5.3. It is actually not strictly necessary to apply the Berry-Esseen Theo¬ 
rem in order to prove Lemma 15.2t Using known concentration results for sums of 
independent random variables like Proposition 3.1 from jCGSll] , for instance, would 
yield a comparable result, albeit with worse constants. 

In order to prove Lemma 14.11 we cite the following concentration inequality from 
|CCS11| : 

Lemma 5.4. Let Yi,..., Y n be independent mean zero random variables such that 

n n 

Y J E[Yf} = l and ( := ^ U|Y }| 3 < oo , 
j = 1 i =1 

then with S^ := one ^ lCLS f or a ^ rea ^ r < s an d all i — 1 ,..., n that 

P(r < S {i) <s)< y/ 2 is - r) + 2(^2 + 1)C . 

We first prove (I4.34H . Define 

1 N 3 

W$ := W N s — a~ l X x = -(j2 X i ~ aa ) 

a 3 =2 

such that 

W N s = W$ + a~ 1 X 1 and W* = W$ + a^Y . 


Proof of Lemma f.l 
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Then, using Lemma [5. II we have 

| P(W* < z) — P(W N s < z) | 

= | P{W§ + a~ l Y <z)~ P(W§ + o r_1 A" 1 < z) | 
<P(z- a~ 1 {X 1 V Y) < W$ <z- a~\X 1 A Y)) 


= E 


P 


N s 

crz — {X 1 V Y) + a(a — N s + 1) /Xj — a\ 

cy/N* cy 


_ _ \ 

zy/W s ) 


< 


crz — (Xi A Y) + a(a — N s + 1) 
cy/W 


N l 


Now note that conditionally on N s the random variables W^} and (Xi, Y) are inde¬ 
pendent and that the statement of Lemma 15.41 may be applied to the random variable 
in the middle term of the above conditional probabilty giving the bound 


(5.2) 


\P(W* <z)~ P(W N s <z)\<E 


y/2\Y-X 1 1 2(y / 2 + l)d 3 


zy/N* 




Noting that (X i, T) and N s are independent and using (I4.18P again, we obtain 

lY-XA 


(5.3) 


E 


y/N° 


<±^E[ {N r^]<^ 


as 


(5.4) 


„r f v »,_l/21 _ £.[CV] CAN 

e ( n) J - “Epvr - ~wr 


by (12. Ill and Jensen’s inequality. From (15.21) , (15.31) and (15.41) the bound (14.341) follows. 
Next we prove (14.351) . Using Lemma [5. II we obtain 

| P(Wn* < z) ~ P(W < z)\ — |-E[J(1 {w+v<z} ~ l{iv<z})] 

— E[(l — J){l{w N s-v<z} ~ IjWjv-s<^})] | 

— E[j^{z-{V\/0)<W<z-(V/\0)}\ 

+ U[(l — J)1{ z +(VA0)<W N 3<Z+(W0)}] 

(5.5) =: Ai + A 2 . 

To bound Ai we write 

OO 

Al = ^ F , [jl{AT=n}l{z-(yv0)<IV<2-(VA0)}] 


n =0 


(5.6) 


P(z - (V V 0) < W < z - (V A 0) I D > 0, JV = n) ■ P(D > 0, N = n) . 


n=0 


Now note that conditionally on the event that D > 0 and N — n the random 
variables W and V are independent and 


C(W | D > 0, N = n) = C{W n ). 
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Thus, using Lemma [5.21 we have for all n > 1: 

P(z - (V V 0) < W < z - (V A 0) | D > 0, N = n) 


= P(z - (V V 0) < W n < z - (V AO) | D > 0, N = n) 


(5.7) 


< E 


<?\V\ + 2 C K d 3 


cV^V^ c 3 ^/n 

From (15.6p and (15.7p we thus have 


D > 0, N = n 


Ai < P(N = 0) + J2 E 


o\V\ + 2 C K d 3 


—( [cV^TTy/n c 3 s/n 


P(N = 0) + ^E 


n =1 


'-{D>0,N=n} 


*\V\ 


D > 0, N = n 
2C K d 3 


P(D > 0, N = n) 


+ 


z\p 2 myfn (?\fn 


(5.8) = P{N = 0) + E 

Now note that 


J 1 


{N> 1 } 


*\V\ 


+ 


2C K d 3 


y\Z2ny/N c 3 VN 


e[ji {n > 1 } \v\n- 1 / 2 ] = e\ji {n > 1 } n-^ 2 e[\v\ |iV,iV s ] 


(5.9) 

(5.10) 


Jl {A r> 1} fV- 1/ Y^^ 2 |7V ) ^] 
J1{n>i}N~ 1/2 Vc?D + a 2 D 2 


< E 

= —E 
a 

< h -E[JDl {N > l} N~ 1 / 2 ] . 


It remains to bound A 2 . We may assume that P(D < 0) > 0 since otherwise A 2 = 0. 
Noting that N s > 1 almost surely, similarly to (15.61) we obtain 


A 2 = P{z + (V A 0) < W N s < z + (V V 0) | D < 0, N s = m) 

m= 1 

■P(D < 0 ,N S = m) . 

Now, using the fact that conditionally on the event {N s = m\A{D < 0} the random 
variables Wn» and V are independent and 

£(Wn'\N 8 = m,D < 0) = C{W m ) 

in the same manner as (15.81) we find 


(5.11) 


A 2 < E 


(1 -J) 



2C K d 3 Y 

c 3 Vw)_ 


Using (12.11) we have 


(5.12) 


^r 1 ] 


i _ i 
E[N] ~ a' 
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Thus, from the Cauchy-Schwarz inequality and (15.121) we obtain 


E 


(1 ~J) 




<\/E[(N-)-yE[(l-J)V*] 

1 f^E[\D\(l- J)] +a 2 B[B 2 (l- J)] 
sjE[D*(1 - J)\ . 


(5.13) 

Similarly, we have 


(5.14) 


E 


1 - J 
L 


VP(D < 0 ) 


< 


a^a 
b 


<j\ a. 


< VHD < «)Je[(n-)-'} 


a 


Thus, from (15. 8 j) . (15.101) . (15.111) . (15.131) and (15.141) we see that A\ + A 2 is bounded 
from above by the right hand side of (14.351) . Using (15.81) and (15.91) instead gives the 
bounds (14.361) and (14.371) . 

□ 


Proof of Lemma f.2, We only prove (14.40I) . the proofs of (14.41)1 and (14.421) being 
similar and easier. By the definition of conditional expectation given an event, we 
have 

E[J\V |1 {z—(v\/o)<w N <z— (VAO)}] 

oo 

= £’[l{N=n,D>0}|l /A |l{2-(yv0)<W Ti <^-(VA0)}] = E [lpv =0 } J\V\] 

n= 0 
oo 

(5.15) + ^ E\\V\ l{ 2 -(vvo)<w n <^-(yAO)} | N — n, D > 0] • P(N — n, D > 0). 

n =1 

Now, for n > 1, using the fact that the random variables and V are conditionally 
independent given the event {D > 0} D {N = n}, from Lemma [5.21 we infer that 

E[\\ |l{z-(yvo)<vv n <z-(VAO)} | N = n, D > 0 ] 


(5.16) 


= E 


|V|( 51 _ + ^ 3 


N — n, D > 0 


'Cy/2ny/ri (?yfn 
Combining (j5. 15(1 and (15.16P we get 

E[J\V |1 {z— (V\/0 )<Wn<z— (VAO)}] < E[l{ N=0 yJ\V\] 


\v\( dhL + 2C ^ 3 


n=1 


-c^/^ny/n c'\fn 
(5.17) = E[1 {n=0} J\V\]+e[i {n > 1 } J\V\( 

Using Cauchy-Schwarz as well as 


N — n, D > 0 


• P(N = n, D > 0) 


a\V\ zo/c 


2CW 3 \i 


:^ 2 fty/N c 3 VN 


b 2 


E[JV 2 ] = E[JE[V 2 \N,N s ]^ <—E[JD 2 ] 


E[1 {n=0 } J\V\]< -y/P{N = 0)y/E[JD*]. 


we obtain 
(5.18) 
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Analogously to (15.1 OP one can show that 

(5.19) E[1 {n > 1} N~ 1/2 JV 2 ] < ~^E [1 { at> 1} A” _1/2 JD 2 ] . 

Hence, bound (I4.40P follows from (I5.17P . (j5.18[) . (I5.10j) and (15. 19ft . 


□ 
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